Notebook for working through Chapter 2 of the GGPlot book
library(tidyverse)
[30m── [1mAttaching packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──[39m
[30m[32m✓[30m [34mtibble [30m 3.0.0 [32m✓[30m [34mdplyr [30m 0.8.5
[32m✓[30m [34mtidyr [30m 1.0.2 [32m✓[30m [34mstringr[30m 1.4.0
[32m✓[30m [34mreadr [30m 1.3.1 [32m✓[30m [34mforcats[30m 0.5.0
[32m✓[30m [34mpurrr [30m 0.3.3 [39m
[30m── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
names(mpg)
[1] "manufacturer" "model" "displ" "year" "cyl" "trans" "drv" "cty" "hwy" "fl"
[11] "class"
Exercises, Section 2.2
- Convert distance traveled with a fixed amount of fuel (mpg) to fuel consumed over a fixed distance (gpm –> lp100km) to distance traveled with a
function mpg_to_lp100km(mpg){
Error: unexpected symbol in "function mpg_to_lp100km"
- Which manufacturer has the most models in this dataset? Which model has the most variations? Does your answer change if you remove the redundant specification of drivetrain?
Which manufacturer has the most models in this dataset? Toyota
Which model has the most variations? jetta
What if you remove redundant drivetrain?
mpg_data <- mpg %>% mutate(clean_model = str_trim(str_replace(model, c("2wd" = "", "4wd" = "", "awd" = "", "quattro" = ""))))
Error in fix_replacement(replacement) :
argument "replacement" is missing, with no default
Which manufacturer as the most models in this case? Still Toyota, but fewer models
Which model has the most variations? a4
Exercises, Section 2.3
- How would you describe the relationship between cty and hwy? Any concerns about drawing conclusions from this plot? Seems like a pretty direct relationship, hwy generally higher than cty. Seems like less data at higher fuel economy so hard to now how relationship holds. Odd to see “striping”, wonder how values were estimated.

- What does ggplot(mpg, aes(model, manufacturer)) + geom_point() show? Is it useful? How could you modify the data to make it more informative? I have no idea what this is supposed to show, or how it could be useful.

Could instead plot number of models per manufact.

Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code. – I made these.

Exercises, Section 2.4
- Experiment with the colour, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?
–> Map size to displacement, color to drivetrain

plot disp to hwy fuel economy with year

What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why? –> get an error, continuous variable can’t be matched to shape –> map trans to shape get a warning because too many values –> when you map discrete to size (e.g. drv to size), also get a warning


How is drive train related to fuel economy? How is drive train related to engine size and class?
Front wheel drive most fuel efficient, 4 wd least

Could also do with a boxplot

How is drive train related to engine size and class?
Rear wheel drive have largest engine size, front wheel drive smalles. SUVs, pickups largely 4WD. Midsize usually fwd, though some 4WD

#### Exercises, Section 2.5 What happens if you try to facet by a continuous variable like hwy? What about cyl? What’s the key difference?
Facet by highway, get too many graphs (one for each value)

Facet wrap by cyl and more points/graph, because fewer unique values.

Use facetting to explore the 3-way relationship between fuel economy, engine size, and number of cylinders. How does facetting by number of cylinders change your assessement of the relationship between engine size and fuel economy?
–> See above for cty graph, reproduced here for hwy
Both show that the apparent relationship between engine size and fuel economy (smaller engine size –> better fuel economy) is actually largely due to differences in the number of cylinders. (Relationship flat when # of cylinders is held constant, except for small engine sizes, 4 cyl)

Can also see this using aes mapping

Read the documentation for facet_wrap(). What arguments can you use to control how many rows and columns appear in the output? (nrow and n_col)
What does the scales argument to facet_wrap() do? When might you use it? (scales = “fixed”, “free”, “fixed_x”, “fixed_y”. this will control whether the scales are automatically shared across facet rows and columns vs. set by the data in each column. Fixed is default)
Note how this changes the above facet display of engine size, cyl, and fuel economy

Exercises, Section 2.6
What’s the problem with the plot created by ggplot(mpg, aes(cty, hwy)) + geom_point()? Which of the geoms described above is most effective at remedying the problem?

Is the problem the overlapping values? Then we could plot with geom_jitter() instead. (But I don’t really see the problem here… )

One challenge with ggplot(mpg, aes(class, hwy)) + geom_boxplot() is that the ordering of class is alphabetical, which is not terribly useful. How could you change the factor levels to be more informative?

We can order by mean fuel efficiency, setting factor levels by the mean value. It would be nice if there was some way to do this without calculating the group mean explicitly… there is! let’s try fct_reorder

Actually, can do this in one step in ggplot! do by median in this case…

Explore the distribution of the carat variable in the diamonds dataset. What binwidth reveals the most interesting patterns? At smaller binwidths (.25 or .1) can see peak at small carat, and then at each whole carat mark. Actually think .1 looks the best.

Explore the distribution of the price variable in the diamonds data. How does the distribution vary by cut?
–> can you set different binwidths for different layered histograms?
By the averages, Ideal cut less expensive (but also on average smaller). Visualize with a violin plot:
General trend of price vs. carat
I guess a general trend of increasing with carat and cut, but certainly noisy.

Let’s look at a violin plot. First of price x cut

And carat x cut

Try a facet plot to look at all 3 at once.

You now know (at least) three ways to compare the distributions of subgroups: geom_violin(), geom_freqpoly() and the colour aesthetic, or geom_histogram() and facetting. What are the strengths and weaknesses of each approach? What other approaches could you try?
- Violin plot: compact and clean, can only look at 1 variable at a time. Shapes are a complicated abstraction and not suitable for detailed scrutiny.
- Frequency polygon: compact to look at more than 1 subgroup at a time. Interpretation very dependent on binwidth.
- Histogram + facetting: Can be hard to compare across all subgroups. Interpretation dependent on binwidth.
Read the documentation for geom_bar(). What does the weight aesthetic do? The weight aesthetic calculates a sum of weights instead of a subgroup count.
Using the techniques already discussed in this chapter, come up with three ways to visualise a 2d categorical distribution. Try them out by visualising the distribution of model and manufacturer, trans and class, and cyl and trans.
- Model and manufacturer – I still don’t really understand what this is. Again, going to use count of models, as above.

- trans and class


- cyl and trans

---
title: "GGPlot Chapter 2"
output: html_notebook
author: "Hannah Gelman"
date: "2020-04-08"
---

Notebook for working through Chapter 2 of the GGPlot book 

```{r}
library(ggplot2) 
library(tidyverse)
```


```{r}
head(mpg)
```
```{r}
names(mpg)
```

#### Exercises, Section 2.2 

3. Convert distance traveled with a fixed amount of fuel (mpg) to fuel consumed over a fixed distance (gpm --> lp100km) to distance traveled with a 
```{r}
mpg_to_lp100km <- function(mpg){ 
  #conversion gallons to liters 
  lit_p_gal <- 3.78 
  #conversion miles to 1 km 
  mil_p_km <- .62
  
  lp100km <- 1/mpg * lit_p_gal * mil_p_km *100 
  return(lp100km) 
}

mpg_data <- mpg %>% mutate(hwy_lp100km = mpg_to_lp100km(hwy), 
                           cty_lp100km = mpg_to_lp100km(cty))

```

4. Which manufacturer has the most models in this dataset? Which model has the most variations? Does your answer change if you remove the redundant specification of drivetrain? 

Which manufacturer has the most models in this dataset? Toyota 
```{r}
mpg %>% group_by(manufacturer) %>% summarize(nmodels = n_distinct(model)) %>% arrange(-nmodels)
```

Which model has the most variations? jetta
```{r}
mpg %>% group_by(model) %>% summarize(nvar = n(), 
                                      nvar_not_year = n_distinct(displ, cyl, trans, drv)) %>% 
  arrange(-nvar_not_year)
```
What if you remove redundant drivetrain? 

```{r}
#this function doesn't work as expected - returns same (last) value for all. why? 
remove_drv <- function(modelname){
  #remove drivetrain specification if included in model name 
  model_arr = strsplit(modelname, " ") 
  
  n = length(model_arr)
  #if the length is 1, return modelname as before 
  if(n == 1) {
    return(modelname)
  } else if(model_arr[n] %in% c('4wd', '2wd', 'quattro', 'awd')) {
    return(paste(model_arr[1:n-1]))
  } else { 
    return(modelname) }
  
}

mpg_data <- mpg %>% mutate(clean_model = str_trim(str_replace_all(model, c("2wd" = "", "4wd" = "", "awd" = "", "quattro" = ""))))
```

Which manufacturer as the most models in this case? Still Toyota, but fewer models 
```{r}
mpg_data %>% group_by(manufacturer) %>% summarize(nmodels = n_distinct(clean_model)) %>% arrange(-nmodels)
```
Which model has the most variations? a4 
```{r}
mpg_data %>% group_by(clean_model) %>% 
  summarize(nvar = n(), 
            nvar_not_year = n_distinct(displ, cyl, trans, drv)) %>% 
  arrange(-nvar_not_year)

#mpg_data %>% filter(clean_model == 'a4') %>% arrange(drv, trans, cyl)
#by inspection, same config does not appear repeated by year. 
```

#### Exercises, Section 2.3 
1. How would you describe the relationship between cty and hwy? Any concerns about drawing conclusions from this plot? 
Seems like a pretty direct relationship, hwy generally higher than cty. Seems like less data at higher fuel economy so hard to now how relationship holds. Odd to see "striping", wonder how values were estimated. 

```{r}
ggplot(mpg, aes(hwy, cty)) + 
  geom_point()
```


2. What does ggplot(mpg, aes(model, manufacturer)) + geom_point() show? Is it useful? How could you modify the data to make it more informative?
I have no idea what this is supposed to show, or how it could be useful. 
```{r}
ggplot(mpg, aes(model, manufacturer)) + 
  geom_point()
```

Could instead plot number of models per manufact. 
```{r}
mpg_data %>% group_by(clean_model) %>% group_by(manufacturer) %>% summarize(nmodels = n_distinct(clean_model)) %>% 
  ggplot(aes(manufacturer, nmodels)) +
  geom_bar(stat = "identity")
```

Describe the data, aesthetic mappings and layers used for each of the following plots. You’ll need to guess a little because you haven’t seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.
-- I made these. 
```{r}
ggplot(mpg, aes(cty)) + geom_histogram()
```

#### Exercises, Section 2.4 
1) Experiment with the colour, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?

--> Map size to displacement, color to drivetrain 
```{r}
ggplot(mpg, aes(hwy, cty, size = displ, color = drv)) + geom_point()
```

plot disp to hwy fuel economy with year
```{r}
ggplot(mpg, aes(displ, hwy, color = as.character(year))) + geom_point()
```

What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?
--> get an error, continuous variable can't be matched to shape 
--> map trans to shape get a warning because too many values 
--> when you map discrete to size (e.g. drv to size), also get a warning
```{r}
ggplot(mpg, aes(hwy, cty, shape = trans)) + geom_point()
```

```{r}
ggplot(mpg, aes(hwy, cty, size = drv)) + geom_point()

```

How is drive train related to fuel economy? How is drive train related to engine size and class?

Front wheel drive most fuel efficient, 4 wd least 
```{r}
mpg %>% group_by(drv) %>% summarize(avg_hwy = mean(hwy), 
                                    sd_hwy = sd(hwy)) %>% 
  ggplot(aes(drv, avg_hwy)) +
  geom_bar(stat = "identity") + 
  geom_errorbar(aes(ymin = avg_hwy - sd_hwy, ymax = avg_hwy + sd_hwy, width=.2))
```

Could also do with a boxplot
```{r}
ggplot(mpg, aes(drv, hwy)) + 
  geom_boxplot()
```

How is drive train related to engine size and class?

Rear wheel drive have largest engine size, front wheel drive smalles. SUVs, pickups largely 4WD. Midsize usually fwd, though some 4WD
```{r}
ggplot(mpg, aes(drv, displ, color = class)) +
  geom_point()
```
 
 #### Exercises, Section 2.5 
What happens if you try to facet by a continuous variable like hwy? What about cyl? What’s the key difference?

Facet by highway, get too many graphs (one for each value)
```{r}
ggplot(mpg, aes(displ, cty)) +
  geom_point() + 
  facet_wrap(~hwy)
```

Facet wrap by cyl and more points/graph, because fewer unique values. 

```{r}
ggplot(mpg, aes(displ, cty)) +
  geom_point() + 
  facet_wrap(~cyl)
```

Use facetting to explore the 3-way relationship between fuel economy, engine size, and number of cylinders. How does facetting by number of cylinders change your assessement of the relationship between engine size and fuel economy?

--> See above for cty graph, reproduced here for hwy 

Both show that the apparent relationship between engine size and fuel economy (smaller engine size --> better fuel economy) is actually largely due to differences in the number of cylinders. (Relationship flat when # of cylinders is held constant, except for small engine sizes, 4 cyl) 

```{r}
ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  facet_wrap(~cyl)
```

Can also see this using aes mapping 
```{r}
ggplot(mpg, aes(displ, hwy, color = as.character(cyl))) + geom_point()

```


Read the documentation for facet_wrap(). What arguments can you use to control how many rows and columns appear in the output? (nrow and n_col)

What does the scales argument to facet_wrap() do? When might you use it? (scales = "fixed", "free", "fixed_x", "fixed_y". this will control whether the scales are automatically shared across facet rows and columns vs. set by the data in each column. Fixed is default) 

Note how this changes the above facet display of engine size, cyl, and fuel economy
```{r}
ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  facet_wrap(~cyl, scales = "free")
```

#### Exercises, Section 2.6

What’s the problem with the plot created by ggplot(mpg, aes(cty, hwy)) + geom_point()? Which of the geoms described above is most effective at remedying the problem?
```{r}
ggplot(mpg, aes(cty, hwy)) + geom_point()
```

Is the problem the overlapping values? Then we could plot with geom_jitter() instead.
(But I don't really see the problem here... )
```{r}
ggplot(mpg, aes(cty, hwy)) + geom_jitter()
```

One challenge with ggplot(mpg, aes(class, hwy)) + geom_boxplot() is that the ordering of class is alphabetical, which is not terribly useful. How could you change the factor levels to be more informative?

```{r}
ggplot(mpg, aes(class, hwy)) + geom_boxplot()
```
 
 We can order by mean fuel efficiency, setting factor levels by the mean value. 
 It would be nice if there was some way to do this without calculating the group mean explicitly... 
 there is! let's try fct_reorder
 
```{r}
mpg %>% 
  mutate(class = fct_reorder(class, hwy, .fun='mean')) %>% 
  ggplot(aes(class, hwy)) + geom_boxplot()
```

Actually, can do this in one step in ggplot! do by median in this case... 
```{r}
ggplot(mpg, aes(reorder(class, hwy, median), hwy)) + geom_boxplot()
```

Explore the distribution of the carat variable in the diamonds dataset. What binwidth reveals the most interesting patterns?
At smaller binwidths (.25 or .1) can see peak at small carat, and then at each whole carat mark. 
Actually think .1 looks the best. 
```{r}
ggplot(diamonds, aes(carat)) + geom_histogram(binwidth = .1)
```

Explore the distribution of the price variable in the diamonds data. How does the distribution vary by cut?

--> can you set different binwidths for different layered histograms? 
```{r}
diamonds %>% group_by(cut) %>% 
  summarize(avg_price = mean(price), 
            avg_carat = mean(carat), 
            num_records = n())
```

By the averages, Ideal cut less expensive (but also on average smaller). Visualize with a violin plot: 

General trend of price vs. carat 

I guess a general trend of increasing with carat and cut, but certainly noisy. 

```{r}
ggplot(diamonds, aes(carat, price, color = cut)) + 
  geom_point() + 
  geom_smooth()
```

Let's look at a violin plot. First of price x cut 
```{r}
ggplot(diamonds, aes(cut, price)) + 
  geom_violin()

```

And carat x cut 

```{r}
ggplot(diamonds, aes(cut, carat)) + 
  geom_violin()

```

Try a facet plot to look at all 3 at once. 
```{r}
ggplot(diamonds, aes(carat, price)) +
  geom_point() + 
  facet_wrap(~cut)
```

You now know (at least) three ways to compare the distributions of subgroups: geom_violin(), geom_freqpoly() and the colour aesthetic, or geom_histogram() and facetting. What are the strengths and weaknesses of each approach? What other approaches could you try?

- Violin plot: compact and clean, can only look at 1 variable at a time. Shapes are a complicated abstraction and not suitable for detailed scrutiny. 
- Frequency polygon: compact to look at more than 1 subgroup at a time. Interpretation very dependent on binwidth. 
- Histogram + facetting: Can be hard to compare across all subgroups. Interpretation dependent on binwidth. 


Read the documentation for geom_bar(). What does the weight aesthetic do? The weight aesthetic calculates a sum of weights instead of a subgroup count. 

Using the techniques already discussed in this chapter, come up with three ways to visualise a 2d categorical distribution. Try them out by visualising the distribution of model and manufacturer, trans and class, and cyl and trans.

1) Model and manufacturer 
-- I still don't really understand what this is. Again, going to use count of models, as above.  
```{r}
mpg_data %>% group_by(clean_model) %>% group_by(manufacturer) %>% summarize(nmodels = n_distinct(clean_model)) %>% 
  ggplot(aes(manufacturer, nmodels)) +
  geom_bar(stat = "identity")
```

2) trans and class
```{r}
ggplot(mpg_data, aes(class)) +
  geom_bar() +
  facet_wrap(~trans)

```

```{r}
ggplot(mpg_data, aes(class, fill = trans)) +
  geom_bar() 
```


3) cyl and trans
```{r}
ggplot(mpg_data, aes(trans, cyl, color = trans)) +
  geom_jitter()
```

